AITopics | replication study

Collaborating Authors

replication study

Information about AI from the News, Publications, and Conferences

Automatic Classification – Tagging and Summarization – Customizable Filtering and Analysis

If you are looking for an answer to the question What is Artificial Intelligence? and you only have a minute, then here's the definition the Association for the Advancement of Artificial Intelligence offers on its home page: "the scientific understanding of the mechanisms underlying thought and intelligent behavior and their embodiment in machines."

However, if you are fortunate enough to have more than a minute, then please get ready to embark upon an exciting journey exploring AI (but beware, it could last a lifetime) …

Automatic Classification of User Requirements from Online Feedback -- A Replication Study

Bhatt, Meet, Boilard, Nic, Chaudhary, Muhammad Rehan, Thompson, Cole, Idoko, Jacob, Sorathiya, Aakash, Ginde, Gouri

arXiv.org Artificial IntelligenceJul-30-2025

Natural language processing (NLP) techniques have been widely applied in the requirements engineering (RE) field to support tasks such as classification and ambiguity detection. Although RE research is rooted in empirical investigation, it has paid limited attention to replicating NLP for RE (NLP4RE) studies. The rapidly advancing realm of NLP is creating new opportunities for efficient, machine-assisted workflows, which can bring new perspectives and results to the forefront. Thus, we replicate and extend a previous NLP4RE study (baseline), "Classifying User Requirements from Online Feedback in Small Dataset Environments using Deep Learning", which evaluated different deep learning models for requirement classification from user reviews. We reproduced the original results using publicly released source code, thereby helping to strengthen the external validity of the baseline study. We then extended the setup by evaluating model performance on an external dataset and comparing results to a GPT-4o zero-shot classifier. Furthermore, we prepared the replication study ID-card for the baseline study, important for evaluating replication readiness. Results showed diverse reproducibility levels across different models, with Naive Bayes demonstrating perfect reproducibility. In contrast, BERT and other models showed mixed results. Our findings revealed that baseline deep learning models, BERT and ELMo, exhibited good generalization capabilities on an external dataset, and GPT-4o showed performance comparable to traditional baseline machine learning models. Additionally, our assessment confirmed the baseline study's replication readiness; however missing environment setup files would have further enhanced readiness. We include this missing information in our replication package and provide the replication study ID-card for our study to further encourage and support the replication of our study.

large language model, machine learning, natural language, (19 more...)

arXiv.org Artificial Intelligence

2507.21532

Country: North America > Canada > Alberta > Census Division No. 6 > Calgary Metropolitan Region > Calgary (0.14)

Genre: Research Report > New Finding (1.00)

Industry: Education (0.93)

Technology:

Information Technology > Artificial Intelligence > Natural Language > Large Language Model (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Learning Graphical Models > Directed Networks > Bayesian Learning (0.89)

Add feedback

Can AI Replace Human Subjects? A Large-Scale Replication of Psychological Experiments with LLMs

Cui, Ziyan, Li, Ning, Zhou, Huaikang

arXiv.org Artificial IntelligenceSep-3-2024

Artificial Intelligence (AI) is increasingly being integrated into scientific research, particularly in the social sciences, where understanding human behavior is critical. Large Language Models (LLMs) like GPT-4 have shown promise in replicating human-like responses in various psychological experiments. However, the extent to which LLMs can effectively replace human subjects across diverse experimental contexts remains unclear. Here, we conduct a large-scale study replicating 154 psychological experiments from top social science journals with 618 main effects and 138 interaction effects using GPT-4 as a simulated participant. We find that GPT-4 successfully replicates 76.0 percent of main effects and 47.0 percent of interaction effects observed in the original studies, closely mirroring human responses in both direction and significance. However, only 19.44 percent of GPT-4's replicated confidence intervals contain the original effect sizes, with the majority of replicated effect sizes exceeding the 95 percent confidence interval of the original studies. Additionally, there is a 71.6 percent rate of unexpected significant results where the original studies reported null findings, suggesting potential overestimation or false positives. Our results demonstrate the potential of LLMs as powerful tools in psychological research but also emphasize the need for caution in interpreting AI-driven findings. While LLMs can complement human studies, they cannot yet fully replace the nuanced insights provided by human subjects.

effect size, original study, replication, (16 more...)

arXiv.org Artificial Intelligence

2409.00128

Country:

North America > Canada > Ontario > Toronto (0.04)
North America > United States > New York > Monroe County > Rochester (0.04)

Genre:

Research Report > New Finding (1.00)
Research Report > Experimental Study (1.00)

Industry: Health & Medicine (1.00)

Technology:

Information Technology > Artificial Intelligence > Natural Language > Large Language Model (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (1.00)

Add feedback

Automatically Finding and Categorizing Replication Studies

de Ruiter, Bob

arXiv.org Artificial IntelligenceNov-25-2023

In many fields of experimental science, papers that failed to replicate continue to be cited as a result of the poor discoverability of replication studies. As a first step to creating a system that automatically finds replication studies for a given paper, 334 replication studies and 344 replicated studies were collected. Replication studies could be identified in the dataset based on text content at a higher rate than chance (AUROC = 0.886). Additionally, successful replication studies could be distinguished from failed replication studies at a higher rate than chance (AUROC = 0.664).

original paper, replication, replication study, (15 more...)

arXiv.org Artificial Intelligence

2311.15055

Country:

Europe > Germany > Lower Saxony > Gottingen (0.15)
North America > United States > California > Santa Clara County > Palo Alto (0.05)

Genre: Research Report (0.56)

Technology:

Information Technology > Artificial Intelligence > Natural Language (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning (0.34)
Information Technology > Artificial Intelligence > Representation & Reasoning > Rule-Based Reasoning (0.33)

Add feedback

A prototype hybrid prediction market for estimating replicability of published work

Chakravorti, Tatiana, Fraleigh, Robert, Fritton, Timothy, McLaughlin, Michael, Singh, Vaibhav, Griffin, Christopher, Kwasnica, Anthony, Pennock, David, Giles, C. Lee, Rajtmajer, Sarah

arXiv.org Artificial IntelligenceMar-1-2023

We present a prototype hybrid prediction market and demonstrate the avenue it represents for meaningful human-AI collaboration. We build on prior work proposing artificial prediction markets as a novel machine-learning algorithm. In an artificial prediction market, trained AI agents buy and sell outcomes of future events. Classification decisions can be framed as outcomes of future events, and accordingly, the price of an asset corresponding to a given classification outcome can be taken as a proxy for the confidence of the system in that decision. By embedding human participants in these markets alongside bot traders, we can bring together insights from both. In this paper, we detail pilot studies with prototype hybrid markets for the prediction of replication study outcomes. We highlight challenges and opportunities, share insights from semi-structured interviews with hybrid market participants, and outline a vision for ongoing and future work.

artificial intelligence, machine learning, participant, (19 more...)

arXiv.org Artificial Intelligence

2303.00866

Country: North America > United States > Pennsylvania (0.04)

Genre:

Questionnaire & Opinion Survey (1.00)
Research Report > Experimental Study (0.69)

Industry: Banking & Finance > Trading (1.00)

Technology:

Information Technology > Artificial Intelligence > Machine Learning (1.00)
Information Technology > Artificial Intelligence > Issues > Social & Ethical Issues (0.49)
Information Technology > Artificial Intelligence > Representation & Reasoning > Agents (0.34)

Add feedback

A Synthetic Prediction Market for Estimating Confidence in Published Work

Rajtmajer, Sarah, Griffin, Christopher, Wu, Jian, Fraleigh, Robert, Balaji, Laxmaan, Squicciarini, Anna, Kwasnica, Anthony, Pennock, David, McLaughlin, Michael, Fritton, Timothy, Nakshatri, Nishanth, Menon, Arjun, Modukuri, Sai Ajay, Nivargi, Rajal, Wei, Xin, Giles, C. Lee

arXiv.org Artificial IntelligenceDec-23-2021

Explainably estimating confidence in published scholarly work offers opportunity for faster and more robust scientific progress. We develop a synthetic prediction market to assess the credibility of published claims in the social and behavioral sciences literature. We demonstrate our system and detail our findings using a collection of known replication projects. We suggest that this work lays the foundation for a research agenda that creatively uses AI for peer review.

johannesson, prediction market, replicability, (15 more...)

arXiv.org Artificial Intelligence

2201.06924

Country: North America > United States > Pennsylvania (0.04)

Genre: Research Report (0.85)

Industry: Banking & Finance > Trading (0.88)

Technology: Information Technology > Artificial Intelligence > Machine Learning (1.00)

Add feedback

What is the Vocabulary of Flaky Tests? An Extended Replication

Camara, B. H. P., Silva, M. A. G., Endo, A. T., Vergilio, S. R.

arXiv.org Artificial IntelligenceMar-23-2021

Software systems have been continuously evolved and delivered with high quality due to the widespread adoption of automated tests. A recurring issue hurting this scenario is the presence of flaky tests, a test case that may pass or fail non-deterministically. A promising, but yet lacking more empirical evidence, approach is to collect static data of automated tests and use them to predict their flakiness. In this paper, we conducted an empirical study to assess the use of code identifiers to predict test flakiness. To do so, we first replicate most parts of the previous study of Pinto~et~al.~(MSR~2020). This replication was extended by using a different ML Python platform (Scikit-learn) and adding different learning algorithms in the analyses. Then, we validated the performance of trained models using datasets with other flaky tests and from different projects. We successfully replicated the results of Pinto~et~al.~(2020), with minor differences using Scikit-learn; different algorithms had performance similar to the ones used previously. Concerning the validation, we noticed that the recall of the trained models was smaller, and classifiers presented a varying range of decreases. This was observed in both intra-project and inter-projects test flakiness prediction.

classifier, flaky test, original study, (16 more...)

arXiv.org Artificial Intelligence

2103.1267

Country:

North America > United States > New York > New York County > New York City (0.04)
South America > Brazil > Paraná > Curitiba (0.04)
Asia > South Korea > Seoul > Seoul (0.04)
(3 more...)

Genre: Research Report > New Finding (1.00)

Technology:

Information Technology > Software > Programming Languages (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Performance Analysis > Accuracy (1.00)
(2 more...)

Add feedback

Digital Twins Are Not Monozygotic -- Cross-Replicating ADAS Testing in Two Industry-Grade Automotive Simulators

Borg, Markus, Abdessalem, Raja Ben, Nejati, Shiva, Jegeden, Francois-Xavier, Shin, Donghwan

arXiv.org Artificial IntelligenceDec-12-2020

The increasing levels of software- and data-intensive driving automation call for an evolution of automotive software testing. As a recommended practice of the Verification and Validation (V&V) process of ISO/PAS 21448, a candidate standard for safety of the intended functionality for road vehicles, simulation-based testing has the potential to reduce both risks and costs. There is a growing body of research on devising test automation techniques using simulators for Advanced Driver-Assistance Systems (ADAS). However, how similar are the results if the same test scenarios are executed in different simulators? We conduct a replication study of applying a Search-Based Software Testing (SBST) solution to a real-world ADAS (PeVi, a pedestrian vision detection system) using two different commercial simulators, namely, TASS/Siemens PreScan and ESI Pro-SiVIC. Based on a minimalistic scene, we compare critical test scenarios generated using our SBST solution in these two simulators. We show that SBST can be used to effectively and efficiently generate critical test scenarios in both simulators, and the test results obtained from the two simulators can reveal several weaknesses of the ADAS under test. However, executing the same test scenarios in the two simulators leads to notable differences in the details of the test outputs, in particular, related to (1) safety violations revealed by tests, and (2) dynamics of cars and pedestrians. Based on our findings, we recommend future V&V plans to include multiple simulators to support robust simulation-based testing and to base test objectives on measures that are less dependant on the internals of the simulators.

pro-sivic, scenario, simulator, (15 more...)

arXiv.org Artificial Intelligence

2012.06822

Country:

North America > United States > Massachusetts > Middlesex County > Burlington (0.04)
North America > Canada > Ontario > National Capital Region > Ottawa (0.04)
Europe > Sweden > Skåne County > Lund (0.04)
Europe > France > Pays de la Loire > Loire-Atlantique > Nantes (0.04)

Genre: Research Report > New Finding (1.00)

Industry:

Transportation > Ground > Road (1.00)
Information Technology (1.00)
Automobiles & Trucks (1.00)

Technology:

Information Technology > Software Engineering (1.00)
Information Technology > Artificial Intelligence > Machine Learning (1.00)
Information Technology > Artificial Intelligence > Robots > Autonomous Vehicles (0.95)
Information Technology > Artificial Intelligence > Representation & Reasoning > Search (0.67)

Add feedback

No, Wearing Red Doesn't Make You Hotter

SlateMay-12-2017, 15:30:10 GMT

What's been less reported is that the red-romance effect has been under scrutiny right from its early days. Now, after several follow-up studies, it seems likely that it does not hold up at all. Evidence for this comes from replications studies--scientific efforts that attempt to replicate an experiment to ensure that the previously found effect remains. One such replication study was published last week in Social Psychology by Robert Calin-Jageman and Gabrielle Lehmann of Dominican University in River Forest, Illinois. This study repeated the strongest experiment from Elliot's 2008 paper as closely as possible, having university students and online participants rate the same photos.

artificial intelligence, replication study, social media, (2 more...)

Slate

Country: North America > United States > Illinois (0.32)

Genre: Research Report (0.98)

Technology:

Information Technology > Communications > Social Media (0.85)
Information Technology > Artificial Intelligence > Representation & Reasoning > Personal Assistant Systems (0.85)

Add feedback